⭐ 📅 📃 👉 📖 🤦‍♂️ 🖖 🤓


1 Captain’s log

Star date 71676.31. Our mission is to use R statistical software to extract star dates mentioned in the captain’s log from the scripts of Star Trek: The Next Generation and observe their progression over the course of the show’s seven seasons. There appears to be some mismatch in the frequency of digits after the decimal point – could this indicate poor abillity to choose random numbers? Or something more sinister? We shall venture deep into uncharted territory for answers…

2 Make it so!

Download all the scripts from Star Trek: The Next Generation from the Star Trek Minutiae website. This is provided as a zipped folder containing 276 text files numbered consecutively from 102 to 277.

3 Energise!

Ready the workspace by loading the packages we’ll need for data manipulation.

library(readr)  # read files
library(purrr)  # iterate functions over files
library(stringr)  # manipulate strings
library(dplyr)  # data manipulation and pipe opeartor (%>%)

4 Lieutenant Commander Data

We’re going to extract the content of the the text files using the read_lines() function from the readr package. We’ll iterate over each file with the map() function from the purrr package to read them into a list object where each element is a script.

scripts <- purrr::map(
  list.files(  # create vector of filepath strings to each file
    "data/scripts",  # file location of the text files 
    full.names = TRUE  # e.g. "data/scripts/102.txt"
    ),
  readr::read_lines # read the content from each filepath
  )

We can take a look at some example lines ([17:34]) from the title page of the first script (element [[1]]).

scripts[[1]][17:34]
##  [1] "                STAR TREK: THE NEXT GENERATION "        
##  [2] "                              "                         
##  [3] "                    \"Encounter at Farpoint\" "         
##  [4] "                              "                         
##  [5] "                              by "                      
##  [6] "                         D.C. Fontana "                 
##  [7] "                              and "                     
##  [8] "                       Gene Roddenberry "               
##  [9] ""                                                       
## [10] ""                                                       
## [11] "This script is not for publicaion or reproduction."     
## [12] "No one is authorized to dispose of the same. If lost or"
## [13] "destroyed, please notify the Script Department."        
## [14] ""                                                       
## [15] ""                                                       
## [16] "                         FINAL DRAFT"                   
## [17] ""                                                       
## [18] "                        April 13, 1987"

Our first example of a star date is in the Captain’s log voiceover in lines 46 to 50 of the first script.

scripts[[1]][46:50]
## [1] "\t\t\t\t\tPICARD V.O."                 
## [2] "\t\t\tCaptain's log, stardate 42353.7."
## [3] "\t\t\tOur destination is planet Cygnus"
## [4] "\t\t\tIV, beyond which lies the great" 
## [5] "\t\t\tunexplored mass of the galaxy."

5 Engage!

We want to extract stardate strings from each script. These are in the form XXXXX.X, where ‘X’ is a digit. We want to capture the stardates mentioned as part of the captain’s log voiceover where possible, so we can start our search pattern with ‘date’. This will help us avoid matching to strings that have a stardate-like pattern but aren’t stardates

We can extract these with str_extract_all() from the stringr package, using a regex (regular expression). Our regex is written date[:space:][[:digit:]\\.[:digit:]]{7}. This means ‘find a string that starts with the word date followed by a space (date), which is followed by a string that contains digits ([:digit:]) with a period (\\.) inside, with a total length of seven characters ({7})’.

This will provide a list object where each element contains the regex-matched string for a script.

stardate_extract <- stringr::str_extract_all(
  scripts,  # location from which to extract
  pattern = "date[:space:][[:digit:]\\.[:digit:]]{7}"  # regex
)

head(stardate_extract)  # see the first few list elements
## [[1]]
## [1] "date 42353.7" "date 42354.1" "date 42354.2" "date 42354.7"
## [5] "date 42372.5"
## 
## [[2]]
## [1] "date 41209.2" "date 41209.3"
## 
## [[3]]
## [1] "date 41235.2" "date 41235.3"
## 
## [[4]]
## [1] "date 41294.5" "date 41294.7"
## 
## [[5]]
## [1] "date 41263.1" "date 41263.2" "date 41263.3" "date 41263.4"
## 
## [[6]]
## [1] "date 41194.6" "date 41194.8"

We’re now going to tody the data to:

  • turn the list into a dataframe (tibble::enframe()) with one row per episode
  • turn this into a dataframe with one row per stardate (tidyr::unnest())
  • rename the columns ‘episode’ and ‘stardate’ (dplyr::transmute()) and remove the instances of the string ‘date’ (stringr::str_replace())
  • create a season column that manually applies the season number to each row depending on its episode number (dplyr::mutate(case_when()))
  • remove strings not in the form XXXXX.X (dplyr::if_else())
  • remove any NAs (dplyr::filter())
stardate_tidy <- stardate_extract %>% 
  tibble::enframe() %>% 
  tidyr::unnest() %>% 
  dplyr::transmute(
    episode = name,
    stardate = stringr::str_replace(
      string = value,
      pattern = "date ",
      replacement = ""
    )
  ) %>% 
  dplyr::mutate(
    season = as.character(
      case_when(
        episode %in% 1:25 ~ "1",
        episode %in% 26:47 ~ "2",
        episode %in% 48:73 ~ "3",
        episode %in% 74:99 ~ "4",
        episode %in% 100:125 ~ "5",
        episode %in% 126:151 ~ "6",
        episode %in% 152:176 ~ "7"
      )
    ),
    stardate = as.numeric(
      dplyr::if_else(
        condition = stardate %in% c("41148..", "40052..", "37650.."),
        true = "NA",
        false = stardate
      )
    )
  ) %>% 
  dplyr::filter(!is.na(stardate))

stardate_tidy
## # A tibble: 263 x 3
##    episode stardate season
##      <int>    <dbl> <chr> 
##  1       1   42354. 1     
##  2       1   42354. 1     
##  3       1   42354. 1     
##  4       1   42355. 1     
##  5       1   42372. 1     
##  6       2   41209. 1     
##  7       2   41209. 1     
##  8       3   41235. 1     
##  9       3   41235. 1     
## 10       4   41294. 1     
## # ... with 253 more rows

6 On screen!

Let’s visualise the stardates by episode.

We can make this interactive using plot

We can use the theme

library(ggplot2)
library(plotly)
library(ggsci)
library(ggthemes)

stardate_dotplot <- stardate_tidy %>% 
  ggplot2::ggplot() +
  geom_point(aes(x = episode, y = stardate, color = season)) +
  labs(title = "Stardates are almost but not quite chronological") +
  theme_solarized_2(light = FALSE) + 
  scale_color_startrek()

plotly::ggplotly(
  stardate_dotplot,
  tooltip = c("stardate", "episode", "season")
)

7 Enhance

Extract them.

stardate_tidy_decimal <- stardate_tidy %>% 
  mutate(
    stardate_decimal = as.numeric(
      str_sub(
        as.character(stardate),
        7,
        7
      )
    ),
    stardate_decimal = ifelse(
      is.na(stardate_decimal),
      0,
      stardate_decimal
    )
  ) %>% 
  select(season, episode, stardate, stardate_decimal)

Datatable of them.

library(DT)

stardate_tidy_decimal %>% 
  mutate(season = as.factor(season)) %>% 
  DT::datatable(
    filter = "top",
    extensions = 'Buttons',
      options = list(
        autoWidth = TRUE,  # column width consistent when making selections
        dom = "Blfrtip",
        buttons = 
          list("copy", list(
            extend = "collection",
            buttons = c("csv", "excel", "pdf"),
            text = "Download"
          ) 
          ),
        # customize the length menu
        lengthMenu = list(
          c(10, 25, 50, -1), # declare values
          c(10, 25, 50, "All") # declare titles
        ), # end of lengthMenu customization
        pageLength = 10
      )
    )

Do a barplot.

stardate_tidy_decimal %>% 
  ggplot2::ggplot() +
  geom_bar(aes(as.character(stardate_decimal)), fill = "#CC0C00FF") +
  labs(
   title = "Decimals one to three are most frequent and zero the least frequent",
    x = "stardate decimal value"
  ) +
  theme_dark() +
  theme_solarized_2(light = FALSE)

8 Belay that

stardate_tidy_decimal %>% 
  ggplot2::ggplot() +
  geom_bar(
    aes(as.character(stardate_decimal)),
    fill= c(
      rep("#CC0C00FF", 10),
      rep("#5C88DAFF", 9),
      rep("#84BD00FF", 10),
      rep("#FFCD00FF", 9),
      rep("#7C878EFF", 10),
      rep("#00B5E2FF", 8),
      rep("#00AF66FF", 8)
    )
  ) +
  labs(
    title = "There's a similar pattern of decimal stardate frequency across seasons",
    x = "stardate decimal value"
  ) +
  facet_wrap(~ season) +
  theme_solarized_2(light = FALSE) + 
  scale_color_startrek()

9 Speculate!

11 R information

sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.3
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] DT_0.4.5           ggthemes_3.4.0     ggsci_2.8         
##  [4] plotly_4.7.1       ggplot2_2.2.1.9000 bindrcpp_0.2      
##  [7] dplyr_0.7.4        stringr_1.2.0      purrr_0.2.4       
## [10] readr_1.1.1        emo_0.0.0.9000    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.15        pillar_1.2.1        compiler_3.4.3     
##  [4] plyr_1.8.4          bindr_0.1           tools_3.4.3        
##  [7] digest_0.6.15       viridisLite_0.3.0   jsonlite_1.5       
## [10] lubridate_1.7.2     evaluate_0.10.1     tibble_1.4.2       
## [13] gtable_0.2.0        pkgconfig_2.0.1     rlang_0.2.0        
## [16] shiny_1.0.5         cli_1.0.0           crosstalk_1.0.1    
## [19] yaml_2.1.18         httr_1.3.1          withr_2.1.1.9000   
## [22] knitr_1.18          htmlwidgets_1.0     hms_0.3            
## [25] rprojroot_1.2       grid_3.4.3          data.table_1.10.4-2
## [28] glue_1.2.0          R6_2.2.2            rmarkdown_1.6      
## [31] tidyr_0.7.2         magrittr_1.5        backports_1.1.1    
## [34] scales_0.5.0.9000   htmltools_0.3.6     assertthat_0.2.0   
## [37] xtable_1.8-2        mime_0.5            colorspace_1.3-2   
## [40] httpuv_1.3.5        labeling_0.3        utf8_1.1.3         
## [43] stringi_1.1.6       lazyeval_0.2.1      munsell_0.4.3      
## [46] crayon_1.3.4

  1. The star date for today’s date (7 March 2018) as calculated using the trekguide.com method